-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4584] [yarn] Remove security manager from Yarn AM. #3484
Conversation
The security manager adds a lot of overhead to the runtime of the app, and causes a severe performance regression. Even stubbing out all unneeded methods (all except checkExit()) does not help. So, instead, penalize users who do an explicit System.exit() by leaving them in "undefined behavior" territory: if they do that, the Yarn backend won't be able to report the final app status to the RM. The result is that the final status of the application might not match the user's expectations. One side-effect of the change is that users who do an explicit System.exit() will lose the AM retry functionality. Since there is no way to know if the exit was because of success or failure, the AM right now errs on the side of it being a successful exit.
@tgravescs @andrewor14 @nishkamravi2 please take a look and let me know if you have questions. I tested successful and failed apps in yarn client and cluster modes, but didn't try explicit System.exit() (as we want to tell users to not do that). |
Test build #23901 has started for PR 3484 at commit
|
Hey @vanzin quick question which conditions listed in @tgravescs' comment here does this no longer satisfy? |
cluster UserClass: These will be actually "success" now. All others should remain the same. |
If this is behavior that we encourage users to avoid, I would think that defaulting to failure when System.exit is called would be preferable. Also, I'm guessing that user apps are more likely to call System.exit if they have an exit code? |
The main reason why I chose this approach is because if we do what you suggest, a successful app will actually be retried and will eventually fail. That sounds worse than not retrying a failed app. |
I see. The only thing this changes is the @vanzin @sryza If |
Actually the current code will fail the app (not succeed) regardless of what System.exit() says, because of the check on L108... so this actually does what Sandy suggests. If we want to not retry successful apps that exit with System.exit(), we'll need to change more code. So currently this does not work for this case in Tom's list: cluster UserClass: The app will be re-run and eventually fail after the retries. |
Wait, actually that's pretty bad right? If I run the same user code from 1.1 that calls |
Yeah, I'm not a fan of that behavior. Let me look at changing it, but also waiting to hear back from Tom. |
Ok, until we figure out a way to avoid that I retract my earlier LGTM because this is currently a regression in a different way. Thanks for digging into this. |
Test build #23901 has finished for PR 3484 at commit
|
Test PASSed. |
Ok, updated to match what was discussed. Tested with |
Test build #23906 has started for PR 3484 at commit
|
Test build #23906 has finished for PR 3484 at commit
|
Test PASSed. |
so I'm trying to understand exactly when this affects performance. I saw in the jira you saw its when you use java word count. So is this only an issue when things do a collect back to the ApplicationMaster or does it affect all driver performance? 2x seems huge for just having the securityManager there. SIGH.. thats unfortunate. We should basically revert back to the 1.1 behavior which is to mark things as succeeded even though they may have failed. |
Hey @tgravescs do you mind if I pull this in for a release candidate? I'm planning to cut one today. If there is some open discussion in this issue though I can defer it. |
Actually reading more in this thread I'll hold off until you or @andrewor14 signs off. |
// this shouldn't ever happen, but if it does assume weird failure | ||
finish(FinalApplicationStatus.FAILED, | ||
// This happens when the user application calls System.exit(). We have the choice | ||
// of either failing of succeeding at this point. We report success to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - of should be or (failing or succeeding)
mostly looks good. I think we should fix the exit code before pulling it in. |
Okay I'd like to pull this in for rc1, so I'm going to merge this with the edits that Tom suggested. Feel free to open up additional patches. |
The security manager adds a lot of overhead to the runtime of the app, and causes a severe performance regression. Even stubbing out all unneeded methods (all except checkExit()) does not help. So, instead, penalize users who do an explicit System.exit() by leaving them in "undefined behavior" territory: if they do that, the Yarn backend won't be able to report the final app status to the RM. The result is that the final status of the application might not match the user's expectations. One side-effect of the change is that users who do an explicit System.exit() will lose the AM retry functionality. Since there is no way to know if the exit was because of success or failure, the AM right now errs on the side of it being a successful exit. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3484 from vanzin/SPARK-4584 and squashes the following commits: 21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit(). 4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM. (cherry picked from commit 915f8ee) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@vanzin I would still be curious if you have more details on the exact performance impact? |
@tgravescs Nishkam (who filed the bug) has most of the background info on the performance impact. But, from a high level, installing a security manager seems to introduce a non-trivial performance penatly into any application. Even when the methods of the security manager are no-ops, it seems just having to check them all the time adds a lot of overhead, since it's done even for operations that should be generally very fast like accessing fields via reflection. If you're really really curious we can add instrumentation to measure the performance later when we're not busy trying to get a release out. |
The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.
So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.
One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.